Customer Profile Analysis

Row

Product Feature

Promotion Feature

Row

Marketing Feature

General Analysis

average_customer
Year_Birth 1968.917
Income 52236.58
Kidhome 0.441934
Teenhome 0.5056484
Dt_Customer 2013-07-10
Recency 49.00768
MntWines 305.1536
MntFruits 26.32399
MntMeatProducts 166.9625
MntFishProducts 37.63534
MntSweetProducts 27.03479
MntGoldProds 43.91143
NumDealsPurchases 2.32535
NumWebPurchases 4.087664
NumCatalogPurchases 2.671487
NumStorePurchases 5.805242
NumWebVisitsMonth 5.321735
AcceptedCmp3 0.07365567
AcceptedCmp4 0.07410755
AcceptedCmp5 0.07275192
AcceptedCmp1 0.06416629
AcceptedCmp2 0.01355626
Complain 0.009037506
Z_CostContact 3
Z_Revenue 11
Response 0.1504745
Age 52.08269
education Graduation
status_marital Married

Correlation Analysis

Column

Heat Map

Column

Correlation Plot

Correlation Funnel

Customer segmentation

Row

Segmentation by Response

Education group by Marital Status

Row

Cluster segmentation by clustering

Customers Segmentation

Cluster 1 - Low Value Customers Cluster 2 - High Value Customers
  • Low or average level of income

  • The majority has one kid or teenager at home

  • Represents the most part of basic level of education

  • Low number of purchases through store purchase. They prefer web purchases or make catalog purchases

  • Negative effect of having kids and teens on advertising campaign acceptance

  • High level of income

  • Meat and wine are preferred

  • The majority has no children

  • Low web visit and high store purchase

  • Number of store purchases decreases when there are kids

  • Selection of wines and fruits, as well as the attractive deals attract customers with higher income

Predictive Model

Column

Machine Learning Model Random Forest

Bootstrapping

Model Accuracy

0.8860759

ROC Curve and AUC

0.9001666

Number of trees

500

Sample size

1660

Model Analysis

  • Among the most important variables for the proposed machine learning model, the number of days since the last purchase (‘Recency’) is very important.

  • Because purchasing in store, on the web, or via the catalog (‘NumStorePurchases’, ‘NumWebPurchases’, ‘NumCatalogPurchases’) is positively correlated with ‘Income’. Eventually, these variables become significant.

  • The history of past campaigns are also crucial factors to this prediction.

  • Nevertheless, variables such as ‘Complain’,‘Age’ of the customer or ‘Marital_Status’ are not so crucial for the model.

Column

ROC Curve and AUC

Feature Importance


  1. ↩︎

---
title: 'R Notebook: iFood CRM Data Analyst Case'
author: "Eduar Felipe Riaño Torres^[felipehuman@gmail.com]"
date: "`r Sys.Date()`"
output: 
  flexdashboard::flex_dashboard:
    #orientation: rows
    social: menu
    source_code: embed
    theme: journal
---

```{r setup, include=FALSE}
# summarization
library(skimr)
library(Hmisc)
library(knitr) # some tables and R Markdown
library(correlationfunnel) # correlation Analysis

# articulation with Python
library(reticulate)

# general visualization
library(ggplot2) # visualization
library(scales) # visualization
library(grid) # visualization
library(gridExtra) # visualization
library(RColorBrewer) # visualization
library(corrplot) # visualization
library(reshape2) # visualization
library(hrbrthemes) # visualization

# general data manipulation
library(dplyr) # data manipulation
library(readr) # input/output
library(data.table) # data manipulation
library(tibble) # data wrangling
library(tidyr) # data wrangling
library(stringr) # string manipulation
library(forcats) # factor manipulation
library(tidyverse) # plotting, cleaning, etc
library(gdata)
library(plyr)

# specific visualization
library(alluvial) # visualization
library(ggrepel) # visualization
library(ggforce) # visualization
library(ggridges) # visualization
library(gganimate) # animations
library(gridExtra) # visualization
library(GGally) # visualization
library(ggExtra) # visualization
library(highcharter) # visualization
library(countrycode) # visualization
library(geofacet) # visualization
library(wesanderson) # color palettes
library(treemapify) # visualization
library(cluster) # visualization
library(gridExtra) # visualization
library(grid) # visualization
library(plotly)
library(magick)

# specific data manipulation
library(lazyeval) # data wrangling
library(broom) # data wrangling
library(purrr) # string manipulation
library(reshape2) # data wrangling
library(rlang) # encoding

# maps / geospatial
library(geosphere) # geospatial locations
library(leaflet) # maps
library(leaflet.extras) # maps
library(maps) # maps

# text / NLP
library(tidytext) # text analysis
library(tm) # text analysis
library(SnowballC) # text analysis
library(topicmodels) # text analysis
library(wordcloud) # test visualization

# analysis
library(lubridate)
library(tidyverse)
library(caret)
library(xgboost)
library(modeest)
library(NbClust)
library(factoextra)
library(tidymodels) # framework for ML
library(ranger)
library(randomForest)
library(vip)
library(DataExplorer)
library(tidyquant)
library(ggplot2)
library(plotly)
library(plyr)
library(flexdashboard)
library(knitr)
library(heatmaply)
library(corrplot)



# Make some noisily increasing data
set.seed(955)
dat <- data.frame(cond = rep(c("A", "B"), each=10),
                  xvar = 1:20 + rnorm(20,sd=3),
                  yvar = 1:20 + rnorm(20,sd=3))

setwd("C:/Users/eduar/Downloads/My Things/Data sets R/iFood")

raw_ifood <- read.csv("ml_project1_data.csv")
raw_ifood <- raw_ifood %>% filter(.,!is.na(Income))
raw_ifood$Income <- as.numeric(raw_ifood$Income %>% gsub("[$,]","",.))
current_year = 2021 
raw_ifood <- mutate(raw_ifood, Age = current_year - raw_ifood$Year_Birth)
raw_ifood <- raw_ifood %>% 
             filter(.,Age<100)
raw_ifood$Dt_Customer <- ymd(raw_ifood$Dt_Customer)

data_numeric <- raw_ifood %>% select_if(., is.numeric) %>% select(-c("AcceptedCmp1","AcceptedCmp2",
                                                                     "AcceptedCmp3","AcceptedCmp4",
                                                                     "AcceptedCmp5","Recency",
                                                                     "Complain","ID","Z_CostContact",
                                                                     "Z_Revenue","Year_Birth"))


# Discrete variables
people_discrete <- raw_ifood %>% select(c('Education','Marital_Status',
                                          'Kidhome','Teenhome','Complain'))

# Continuous variables
people_continuous <- raw_ifood %>% select(c('Year_Birth','Income',
                                            'Dt_Customer','Recency'))


data_products <- raw_ifood %>% select(c('MntWines','MntFruits',
                                        'MntMeatProducts',
                                        'MntFishProducts',
                                        'MntSweetProducts',
                                        'MntGoldProds'))   

data_place <- raw_ifood %>% select(c('NumWebPurchases','NumCatalogPurchases',
                                     'NumStorePurchases','NumWebVisitsMonth'))

data_promotion <- raw_ifood %>% select(c('AcceptedCmp1','AcceptedCmp2','AcceptedCmp3',
                                         'AcceptedCmp4',
                                         'AcceptedCmp5',
                                         'Response',
                                         'NumDealsPurchases'))

product_key_value     <- gather(data_products)
place_key_value       <- gather(data_place)
people_key_value_disc <- gather(people_discrete)
people_key_value_cont <- gather(people_continuous)
promotion_key_value   <- gather(data_promotion)


# Counting ones
campaign_takeup <- raw_ifood %>% 
                   select('AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 
                          'AcceptedCmp4','AcceptedCmp5', 'Response') %>% 
                   colSums()
# Counting zeros
zero     <- function(x) sum(x == 0)
campaign <- raw_ifood %>% 
            select('AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 
                   'AcceptedCmp4','AcceptedCmp5', 'Response')
rechazo  <- numcolwise(zero)(campaign)
rechazo  <- as.data.frame(t(as.matrix(rechazo)))
colnames(rechazo) <- "no_aceptacion"

# Making a data frame with Acceptance and Rejected campaigns 
campaign_takeup <- data.frame(campana = c('AcceptedCmp1', 'AcceptedCmp2', 
                                          'AcceptedCmp3', 
                                          'AcceptedCmp4',
                                          'AcceptedCmp5', 
                                          'Response'),
                              aceptacion    = campaign_takeup[1:6],
                              no_aceptacion = rechazo)

ifood_df_clustering = raw_ifood %>% 
                      select(-ID,-Education,-Marital_Status,-Dt_Customer,
                             -Year_Birth,-Z_CostContact,-Z_Revenue,-Response)

k2 <- kmeans(scale(ifood_df_clustering), 2,  iter.max = 100, 
             nstart = 50, algorithm = "Lloyd")
```

Customer Profile Analysis
=======================================================================

Row
-------------------------------------
    
### Product Feature
    
```{r}
# Plotting bar plots for product variables
p1 <- ggplot(product_key_value, aes(value))                                 +
geom_histogram(alpha=.3, colour = "brown3", fill = "firebrick1", notch = TRUE) +
facet_wrap(~key, scales = 'free_x', ncol = 3)                         +
labs(title = "Boxplot of Product Variables")                          +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(p1)
```

### Promotion Feature
    
```{r}
# Plotting bar-plots for promotion variables
p2 <- ggplot(promotion_key_value, aes(value)) +
geom_histogram(stat = 'count', colour = "brown3", fill = "firebrick1", 
               position="identity",alpha=.3) +
facet_wrap(~key, scales = 'free_x', ncol = 3) +
labs(title = "Bar Plots of Promotion Variables") +
theme(plot.title = element_text(hjust = 0.5))
            
ggplotly(p2)
```
   
Row
-------------------------------------
    
### Marketing Feature
    
```{r}
campaign_takeup_long <- melt(campaign_takeup)

p3 <- ggplot(campaign_takeup_long, aes(x = campana, y = value, 
                                 fill = variable))+
      geom_bar(stat = "identity", position= "dodge", 
      fill = c("#D20000"), alpha=.50,show.legend = NA) +
      theme_bw() +
      labs(x = "Campaign", y = "Acceptance", 
      title="Acceptance of Marketing Campaigns") + 
      coord_flip() +
      scale_y_continuous(labels = scales::comma)

ggplotly(p3)
```
    
### General Analysis

```{r}
average_customer_num <- raw_ifood %>% 
                        select_if(names(.)=="Dt_Customer" | sapply(., is.numeric)) %>% 
                        select(-ID) %>% 
                        summarise_each(funs(mean)) %>%
                        t() %>% 
                        as.data.frame() %>% 
                        format(scientific = F, digits = 2) %>% 
                        setnames("V1", "average_customer")
                        
education      <- mlv(raw_ifood$Education, method="mfv")
status_marital <- mlv(raw_ifood$Marital_Status, method="mfv")
categ          <- data.frame(education,status_marital)

average_customer_categ <- categ %>%
                          t() %>% 
                          as.data.frame() %>% 
                          setnames("V1", "average_customer")

  
average_customer <- rbind(average_customer_num, average_customer_categ)
knitr::kable(average_customer)
```

Correlation Analysis
=======================================================================

Column {data-width=550}
-------------------------------------

### Heat Map

```{r}
colores <- colorRampPalette(c("dodgerblue", "ghostwhite", "firebrick2"))(20)
crr     <- cor(data_numeric, use="complete.obs")

heatmaply_cor(crr,
  xlab = "Features",
  ylab = "Features",
  k_col = 2,
  k_row = 2
)
```

Column {data-width=450}
-------------------------------------
   
### Correlation Plot

```{r, fig.height=6}
corrplot(corr = crr, 
         method="number", 
         col = colores, 
         type="upper", 
         tl.col="black", 
         order="hclust",
         number.cex = 0.71)
```   
 
### Correlation Funnel
    
```{r}
customer_ifoof_binarized <- raw_ifood %>%
                            select(-ID,-Year_Birth,-Z_CostContact,-Z_Revenue, -Dt_Customer) %>%
                            binarize(n_bins = 5, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE)

customer_response_corr  <- customer_ifoof_binarized %>%
                           correlate(Response__1)

customer_response_corr %>%
  plot_correlation_funnel()
```

Customer segmentation
=======================================================================

Row
-----------------------------------------------------------------------

### Segmentation by Response

```{r}
p5 <- raw_ifood %>% 
  ggplot(aes(x = Age, y = Income, color = Teenhome, size = Recency)) +
  geom_point(alpha = 0.25) +
   labs(title = "Customers response according to their age and income") +
  facet_wrap(~Response) +
  scale_y_continuous(labels = scales::comma)
ggplotly(p5)
```


### Education group by Marital Status

```{r}
graph_3 <- ggplot(raw_ifood, aes(x = Education, fill= Marital_Status)) +
           geom_bar(position = position_fill(), alpha=.60) +
           scale_fill_manual(values = c("firebrick4","darkred","#C00000","#FF3334",
                                        "#FF6F77","#FFBBC1","#FFDEE3","#FF8896"))
ggplotly(graph_3)
```

Row
-----------------------------------------------------------------------

### Cluster segmentation by clustering

```{r}
set.seed(15)
clus <- ggplot(raw_ifood, aes(x = Age, y = Income, size = Teenhome)) + 
  geom_point(alpha = 0.25, stat = "identity", 
             aes(color = as.factor(k2$cluster))) +
  scale_color_discrete(name=" ",
              breaks=c("1", "2"),
              labels=c("Cluster 1", "Cluster 2")) +
  ggtitle("Segments of Customers", 
          subtitle = "Using K-means Clustering")  +
  scale_y_continuous(labels = scales::comma)
ggplotly(clus)
```

### Customers Segmentation

+---------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| Cluster 1 - **Low Value Customers**                                                                     | Cluster 2 - **High Value Customers**                                                                    |
+=========================================================================================================+=========================================================================================================+
| -   Low or average level of income                                                                      | -   High level of income                                                                                |
|                                                                                                         |                                                                                                         |
| -   The majority has one kid or teenager at home                                                        | -   Meat and wine are preferred                                                                         |
|                                                                                                         |                                                                                                         |
| -   Represents the most part of basic level of education                                                | -   The majority has no children                                                                        |
|                                                                                                         |                                                                                                         |
| -   Low number of purchases through store purchase. They prefer web purchases or make catalog purchases | -   Low web visit and high store purchase                                                               |
|                                                                                                         |                                                                                                         |
| -   Negative effect of having kids and teens on advertising campaign acceptance                         | -   Number of store purchases decreases when there are kids                                             |
|                                                                                                         |                                                                                                         |
|                                                                                                         | -   Selection of wines and fruits, as well as the attractive deals attract customers with higher income |
+---------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+

Predictive Model
=======================================================================

Column
-------------------------------------

### Machine Learning Model Random Forest

```{r}
tipo_modelo <- "Bootstrapping" 
valueBox(tipo_modelo, 
         icon = "fa-dice")
```

### Model Accuracy

```{r}
exactitud <- 0.8860759
valueBox(exactitud, icon = "fa-compass")
```

### ROC Curve and AUC

```{r}
roc_auc <- 0.9001666	
valueBox(roc_auc, icon = "fa-bar-chart")
```

### Number of trees

```{r}
number_trees <- 500 
valueBox(number_trees, 
         icon = "fa-leaf")
```

### Sample size

```{r}
tamano_mues <- 1660
valueBox(tamano_mues, 
         icon = "fa-pie-chart")
```

### Model Analysis

-   Among the most important variables for the proposed machine learning model, the number of days since the last purchase ('Recency') is very important.

-   Because purchasing in store, on the web, or via the catalog ('NumStorePurchases', 'NumWebPurchases', 'NumCatalogPurchases') is positively correlated with 'Income'. Eventually, these variables become significant.

-   The history of past campaigns are also crucial factors to this prediction.

-   Nevertheless, variables such as 'Complain','Age' of the customer or 'Marital_Status' are not so crucial for the model.

Column
-----------------------------------------------------------------------

### ROC Curve and AUC

```{r}
curva_roc <- image_read('C:/Users/eduar/Downloads/My Things/Data sets R/iFood/curvaroc.jpg')
curva_roc 
```

### Feature Importance

```{r}
importancia <- image_read('C:/Users/eduar/Downloads/My Things/Data sets R/iFood/feature_importance.jpg')
importancia
```